We have acquired data about how much audiences and critics like movies as well as numerous other variables about the movies. This dataset includes information from Rotten Tomatoes and IMDb for a random sample of movies.
We’re interested in learning what attributes make a movie popular.
The data set is comprised of 651 randomly sampled movies produced and released before 2016, and scored on both Rotten Tomatoes and IMDb.
We deal with a retrospective observational study with random sampling and with no random assignment here. This means that we cannot make causal conclusions based on this data. However, we can use this data for making correlation statements, and the results of our findings can be generalized to the whole population of movies released before 2016 and scored on both Rotten Tomatoes and IMDb.
Obviously, the popularity of a movie is mainly based on its content and the art value it offers. Those two factors are very hard if not impossible to calculate, though. However, there are certain characteristics of a movie that we can calculate and classify comparatively easy. Among others, those are genre, duration, and cast quality. Using those and other variables from the data frame, we’ll try creating a prediction model for movie popularity among the audience.
We start our analysis with a high-level overview of the data frame provided. The table below shows the number of movies of three major types (Documentary, Feature Film, and TV Movie) in the data set, the time range of theatre releases (thtr_rel_year), as well as the median audience rating on both IMDB (imdb_rating) and Rotten Tomatoes (audience_score). We choose median instead of mean for the audience rating as we suspect the rating distributions are skewed and the median is a more robust statistics for skewed distributions.
movies %>%
group_by(title_type) %>%
summarise(total = n(), min(thtr_rel_year), max(thtr_rel_year), median(imdb_rating), median(audience_score)) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "condensed", "responsive"), full_width = F)| title_type | total | min(thtr_rel_year) | max(thtr_rel_year) | median(imdb_rating) | median(audience_score) |
|---|---|---|---|---|---|
| Documentary | 55 | 1970 | 2012 | 7.7 | 86 |
| Feature Film | 591 | 1972 | 2014 | 6.5 | 62 |
| TV Movie | 5 | 1993 | 2012 | 7.3 | 75 |
Table 1. Movies statistics grouped by the tytle type
We can see the majority of movies in the data set belongs to the ‘Feature Film’ group. 55 movies are Documentaries and only 5 are TV Movies.
It’s expected that the popularity of movies in these three major groups would depend on very different factors. This partially confirmed by the significant difference between the median rating of Featured Films and two other groups. Thus, we decided to deal only with the most populous group of movies in this data set, Featured Films. We then create a subset (ff for Feature Film) of the data frame and use it from now:
Let’s take a look at the data summary now (we exclude some variables from the summary to make the report shorter):
## genre runtime mpaa_rating
## Drama :301 Min. : 68.0 G : 16
## Comedy : 85 1st Qu.: 93.0 NC-17 : 2
## Action & Adventure: 65 Median :104.0 PG :110
## Mystery & Suspense: 59 Mean :106.6 PG-13 :130
## Horror : 23 3rd Qu.:116.0 R :317
## Other : 15 Max. :202.0 Unrated: 16
## (Other) : 43
## studio thtr_rel_year thtr_rel_month
## Paramount Pictures : 37 Min. :1972 Min. : 1.000
## Warner Bros. Pictures : 29 1st Qu.:1990 1st Qu.: 4.000
## Sony Pictures Home Entertainment: 26 Median :1999 Median : 7.000
## Universal Pictures : 23 Mean :1997 Mean : 6.783
## Warner Home Video : 19 3rd Qu.:2006 3rd Qu.:10.000
## (Other) :451 Max. :2014 Max. :12.000
## NA's : 6
## thtr_rel_day dvd_rel_year dvd_rel_month dvd_rel_day
## Min. : 1.00 Min. :1991 Min. : 1.000 Min. : 1.00
## 1st Qu.: 7.00 1st Qu.:2001 1st Qu.: 3.000 1st Qu.: 7.00
## Median :15.00 Median :2003 Median : 6.000 Median :15.00
## Mean :14.48 Mean :2004 Mean : 6.304 Mean :15.06
## 3rd Qu.:21.50 3rd Qu.:2007 3rd Qu.: 9.000 3rd Qu.:23.00
## Max. :31.00 Max. :2015 Max. :12.000 Max. :31.00
## NA's :6 NA's :6 NA's :6
## imdb_rating imdb_num_votes critics_rating critics_score
## Min. :1.900 Min. : 390 Certified Fresh:116 Min. : 1.00
## 1st Qu.:5.850 1st Qu.: 6276 Fresh :172 1st Qu.: 31.00
## Median :6.500 Median : 17934 Rotten :303 Median : 57.00
## Mean :6.387 Mean : 62861 Mean : 54.78
## 3rd Qu.:7.100 3rd Qu.: 66112 3rd Qu.: 79.00
## Max. :9.000 Max. :893008 Max. :100.00
##
## audience_rating audience_score best_pic_nom best_pic_win best_actor_win
## Spilled:273 Min. :11.00 no :569 no :584 no :500
## Upright:318 1st Qu.:44.50 yes: 22 yes: 7 yes: 91
## Median :62.00
## Mean :60.47
## 3rd Qu.:78.00
## Max. :97.00
##
## best_actress_win best_dir_win top200_box
## no :521 no :548 no :576
## yes: 70 yes: 43 yes: 15
##
##
##
##
##
The variables of most interest are related to movies rating: audience rating on IMDB (imdb_rating), critics score on Rotten Tomatoes (critics_score), and audience score on Rotten Tomatoes (audience_score). We can see that median statistics for all three variables are higher than respective mean statistics. This proves our early assumption of the skeweness of the rating distributions. In fact, all three distributions are left skewed. This can also be demonstrated with the following histograms:
Interesting that an IMDB rating shows a unimodal left-skewed distribution centered around 6.4 (Plot 1) but both scores on Rotten Tomatoes show almost a uniform distribution with a slight left skew and no apparent centers (Plot 2 and 3). This discrepancy could be explained by the larger number of voting users on IMDb so the true population mean is more obvious. But this is questionable as we don’t have data on the number of votes for Rotten Tomatoes scores.
We can also conclude that IMDB users are less likely to give a low rating to a movie than Rotten Tomatoes users and critics. Indeed, 75% of movies received a rating of 58.5 (or 5.85 on a scale of 1 to 10) or higher on IMDB, 44.5 or higher from the audience on RT, and only 31 or higher from the critics on RT.
However, the most plausible cause of the diference in distributions lies in methods of calculating rating and scores on two platforms:
Basically, the Rotten Tomatoes score only counts positive rates while IMDb rating counts all rates.
For example, an audience score of 25% on Rotten Tomatoes can be translated to an IMDb rating between 2.5 and 7.5 depending on individual ratings that made up that 25% score on RT. In other words, in the plots above we do not compare apples to apples (or we should say tomatoes to tomatoes).
Let’s take a look at the median statistics for feature films grouped by genre:
ff %>%
group_by(genre) %>%
summarise(total = n(), IMDb = median(imdb_rating), RT_critics = median(critics_score), RT_audience = median(audience_score)) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "condensed", "responsive"))| genre | total | IMDb | RT_critics | RT_audience |
|---|---|---|---|---|
| Action & Adventure | 65 | 6.00 | 33 | 52.0 |
| Animation | 9 | 6.40 | 48 | 65.0 |
| Art House & International | 14 | 6.50 | 52 | 65.5 |
| Comedy | 85 | 5.70 | 36 | 49.0 |
| Documentary | 3 | 6.90 | 40 | 74.0 |
| Drama | 301 | 6.80 | 67 | 70.0 |
| Horror | 23 | 5.90 | 40 | 43.0 |
| Musical & Performing Arts | 8 | 7.25 | 80 | 82.5 |
| Mystery & Suspense | 59 | 6.50 | 60 | 54.0 |
| Other | 15 | 7.00 | 72 | 74.0 |
| Science Fiction & Fantasy | 9 | 5.90 | 67 | 47.0 |
Table 2. Feature Films rating and scores medians by genre
We can see that Drama is the most populous category with more than half movies of the data set and median IMDb rating of 6.80, RT critics score of 67, and RT audience score of 70. The highest median IMDb rating of 7.25 is in the Musical & Performing Arts category. This group also has the highest scores of all on RT among both critics and the audience. We need to note that only 8 movies fall in this group, so this stats can be quite different for a larger sample. The Comedy group has the lowest median IMDb rating of all, 5.70, second to lowest for RT critics and third to lowest for RT audience.
It’s interesting to see the distribution of movies runtime:
Here we can see a right-skewed unimodal distribution centered around 100 minutes with several outliers.
The data set contains several variables on movies: title, runtime, date of release, production company, cast, nominations, ratings, and scores. We’d love to predict the popularity of the movie (represented by the IMDb rating) based on a certain combination of the rest of the variables using a multiple linear regression method.
We will start with modeling where the response variable is IMDb rating (imdb_rating).
Developing a full model as a reference to all future models would be a good first step. The full model includes all potential predictor variables. However, we would omit some of the variables as they are only in data set for informational purposes and do not make any sense to include in statistical analysis.
Here’s a list of variables we can omit:
director and actor1 through actor5 variables was used to determine whether the movie casts a director, an actor or actress who won Oscar.imdb_url and rt_url obviously cannot be associated with the popularity of a movie, thus it should be omitted.title of the movie by itself is meaningless for answering the research question. But we can try using a title length as a predictor.title_type variable as we only work with Feature Film here.thtr_rel_day) or DVD (dvd_rel_day) release is also meaningless for predicting the movie’s popularity. We can think of some correlation between a month or year and movie popularity. We suspect it’s going to be low if any, though.critics_rating and audience_rating variables as they are basically the derivatives of critics_score and audience_score variables respectively. So, they are obviously collinear and adding more than one of these variables to the model would not add much value to the model.best_pic_nom and best_pic_win variables. Obviously, a movie can’t win Oscar without being nominated. We should remove one of the variables, say, best_pic_win then.The question is whether we should use Rotten Tomatoes scores in predicting IMDb rating and vice versa. Technically speaking, a popular movie would rate high on both platforms (it’s not always the case, though). This means that rating on IMDb would have a positive correlation with audience score on RT despite the fact they are calculated differently. We can check if this is the case by calculating a correlation coefficient:
## [1] 0.8487537
ff %>%
ggplot(aes(y = imdb_rating*10, x = audience_score)) +
geom_jitter(col = 'orange') +
labs(title = 'Plot 5. Correlation plot for IMDb rating and RT audience score') +
geom_smooth(method = lm)plot_ss(y = imdb_rating*10, x = audience_score, data = ff) +
title(main = "Plot 6. Plot of residuals")## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## 36.5493 0.4519
##
## Sum of Squares: 18380.56
## integer(0)
m_obs <- lm(imdb_rating ~ audience_score, data = ff)
ggplot(data = m_obs, aes(x = .fitted, y = .resid)) +
geom_jitter() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Plot 7. Residuals vs fitted", xlab = "Fitted values", ylab = "Residuals")ggplot(data = m_obs, aes(x = .resid)) +
geom_histogram(binwidth = 0.5, fill = "orange", colour = "black") +
xlab("Residuals") +
ggtitle("Plot 8. Residulas distribution")ggplot(data = m_obs, aes(sample = .resid)) +
stat_qq(col = "orange") +
ggtitle("Plot 9. Normal probability plot of the residuals") +
geom_abline(colour = "blue")Indeed, we can see that the correlation coefficient is very high (~0.85) and positive. However, the variance doesn’t seem to be constant (little variance for popular movies and larger variance for unpopular movies) for the reason discussed in Part 3. There’s definitely a correlation between these two variables but it’s not linear.
The previous assumption is even more apparent with RT critics score and critics rating variables plotted against IMDb audience rating (Plots 10-13):
ff %>%
ggplot(aes(y = imdb_rating*10, x = critics_score)) +
geom_jitter(col = 'orange', alpha = 0.7) +
labs(title = 'Plot 10. Correlation plot for IMDb rating and RT critics score') +
geom_smooth(method = lm)m_obs3 <- lm(imdb_rating ~ critics_score, data = ff)
ggplot(data = m_obs3, aes(x = .fitted, y = .resid)) +
geom_jitter(col = 'orange', alpha = 0.7) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Plot 11. Residuals vs fitted for imdb_rating/critics_score pair", x = "Fitted values", y = "Residuals")ff %>%
ggplot(aes(y = imdb_rating*10, x = critics_rating)) +
geom_jitter(col = 'orange', alpha = 0.7) +
labs(title = 'Plot 12. Correlation plot for IMDb rating and RT critics rating') +
geom_smooth(method = lm)m_obs4 <- lm(imdb_rating ~ critics_rating, data = ff)
ggplot(data = m_obs4, aes(x = .fitted, y = .resid)) +
geom_jitter(col = 'orange', alpha = 0.7) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Plot 13. Residuals vs fitted for imdb_rating/critics_rating pair", x = "Fitted values", y = "Residuals")The funnels we see is a perfect illustration of non-constant variability described earlier. Thus we shouldn’t be using RT scores in predicting IMDb rating using multiple linear regression.
The full model should then look like this:
imdb_full <- lm(data = na.omit(ff), imdb_rating ~ genre + runtime + mpaa_rating + thtr_rel_year + thtr_rel_month + dvd_rel_year + dvd_rel_month + imdb_num_votes + best_pic_nom + best_actor_win + best_actress_win + best_dir_win + top200_box)
summary(imdb_full)##
## Call:
## lm(formula = imdb_rating ~ genre + runtime + mpaa_rating + thtr_rel_year +
## thtr_rel_month + dvd_rel_year + dvd_rel_month + imdb_num_votes +
## best_pic_nom + best_actor_win + best_actress_win + best_dir_win +
## top200_box, data = na.omit(ff))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8520 -0.4062 0.0699 0.5605 2.0753
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.871e+01 1.701e+01 2.863 0.004358 **
## genreAnimation -2.379e-01 3.530e-01 -0.674 0.500658
## genreArt House & International 9.488e-01 2.802e-01 3.386 0.000761 ***
## genreComedy -3.070e-02 1.450e-01 -0.212 0.832405
## genreDocumentary 1.029e+00 5.070e-01 2.030 0.042886 *
## genreDrama 6.849e-01 1.247e-01 5.494 6.04e-08 ***
## genreHorror -9.927e-02 2.177e-01 -0.456 0.648561
## genreMusical & Performing Arts 1.186e+00 3.200e-01 3.708 0.000230 ***
## genreMystery & Suspense 4.323e-01 1.625e-01 2.660 0.008038 **
## genreOther 4.802e-01 2.486e-01 1.932 0.053930 .
## genreScience Fiction & Fantasy -8.811e-02 3.179e-01 -0.277 0.781739
## runtime 5.149e-03 2.620e-03 1.965 0.049896 *
## mpaa_ratingNC-17 -2.281e-01 8.845e-01 -0.258 0.796598
## mpaa_ratingPG -6.047e-01 2.600e-01 -2.325 0.020413 *
## mpaa_ratingPG-13 -8.087e-01 2.708e-01 -2.986 0.002952 **
## mpaa_ratingR -5.297e-01 2.628e-01 -2.015 0.044346 *
## mpaa_ratingUnrated 1.343e-01 3.551e-01 0.378 0.705479
## thtr_rel_year -1.095e-02 4.817e-03 -2.274 0.023364 *
## thtr_rel_month 7.542e-03 1.068e-02 0.706 0.480371
## dvd_rel_year -1.062e-02 1.052e-02 -1.009 0.313386
## dvd_rel_month 2.041e-02 1.069e-02 1.909 0.056754 .
## imdb_num_votes 3.500e-06 3.682e-07 9.504 < 2e-16 ***
## best_pic_nomyes 3.665e-01 2.040e-01 1.797 0.072954 .
## best_actor_winyes 1.472e-02 1.040e-01 0.142 0.887487
## best_actress_winyes -1.733e-02 1.141e-01 -0.152 0.879318
## best_dir_winyes 2.011e-01 1.412e-01 1.424 0.155037
## top200_boxyes -9.112e-02 2.378e-01 -0.383 0.701758
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.841 on 546 degrees of freedom
## Multiple R-squared: 0.3896, Adjusted R-squared: 0.3605
## F-statistic: 13.4 on 26 and 546 DF, p-value: < 2.2e-16
The full model includes all variables that we believe meaningful. It has an adjusted R2 of summary(imdb_full)$adj.r.squared which means that the model explains approximately 36% of the variance in the response variable (imdb_rating in this case). Let’s see if we can come up with a model of higher predicting power by changing a combination of predictors.
We’ll be using a forward selection technique here. The forward selection model starts with an empty model. Then, we add variables one-at-a-time until we cannot find any variables that improve the model (as measured by adjusted R2).
# Forward selection based on adjusted r squared
# Create a subset of data that only includes variables we're interested in
ff_imdb <- subset(na.omit(ff), select = c("imdb_rating", "genre", "runtime", "mpaa_rating", "thtr_rel_year", "thtr_rel_month", "dvd_rel_year", "dvd_rel_month", "imdb_num_votes", "best_pic_nom", "best_actor_win", "best_actress_win", "best_dir_win", "top200_box"))
bestformula <- "imdb_rating ~" #starting point
result <- 0
bestresult <- 0
cycle = 2 #start with 2nd column of a data frame
while (cycle < ncol(ff_imdb)) {
i = 2 #start with 2nd column of a data frame
while (i <= ncol(ff_imdb)) {
midformula = paste(as.character(bestformula), colnames(ff_imdb)[i], sep = " + ")
midmodel = lm(as.formula(midformula), data = ff_imdb)
midres = summary(midmodel)$adj.r.squared
if (midres > result) {
result = midres
resformula = midformula
resmodel = midmodel
}
i = i + 1
}
if (result > bestresult) {
bestformula = resformula
cycle = cycle + 1
} else {
break
}
}
best_imdb_fwd_model = lm(bestformula, data = ff_imdb)
summary(best_imdb_fwd_model)##
## Call:
## lm(formula = bestformula, data = ff_imdb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8427 -0.4093 0.0707 0.5695 2.0833
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.851e+01 1.688e+01 2.873 0.004222 **
## imdb_num_votes 3.459e-06 3.537e-07 9.779 < 2e-16 ***
## genreAnimation -2.238e-01 3.497e-01 -0.640 0.522592
## genreArt House & International 9.483e-01 2.786e-01 3.404 0.000713 ***
## genreComedy -2.578e-02 1.434e-01 -0.180 0.857393
## genreDocumentary 1.020e+00 5.052e-01 2.019 0.043925 *
## genreDrama 6.829e-01 1.228e-01 5.558 4.25e-08 ***
## genreHorror -9.543e-02 2.167e-01 -0.440 0.659819
## genreMusical & Performing Arts 1.193e+00 3.186e-01 3.745 0.000200 ***
## genreMystery & Suspense 4.269e-01 1.594e-01 2.678 0.007637 **
## genreOther 4.683e-01 2.467e-01 1.898 0.058179 .
## genreScience Fiction & Fantasy -9.631e-02 3.166e-01 -0.304 0.761131
## thtr_rel_year -1.062e-02 4.774e-03 -2.225 0.026497 *
## mpaa_ratingNC-17 -1.778e-01 8.796e-01 -0.202 0.839905
## mpaa_ratingPG -5.916e-01 2.581e-01 -2.293 0.022245 *
## mpaa_ratingPG-13 -8.009e-01 2.683e-01 -2.985 0.002958 **
## mpaa_ratingR -5.131e-01 2.597e-01 -1.976 0.048673 *
## mpaa_ratingUnrated 1.471e-01 3.526e-01 0.417 0.676637
## runtime 5.503e-03 2.472e-03 2.226 0.026401 *
## best_pic_nomyes 3.846e-01 1.992e-01 1.931 0.054054 .
## dvd_rel_month 1.902e-02 1.048e-02 1.814 0.070152 .
## best_dir_winyes 2.028e-01 1.407e-01 1.441 0.150160
## dvd_rel_year -1.084e-02 1.043e-02 -1.039 0.299084
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8385 on 550 degrees of freedom
## Multiple R-squared: 0.3888, Adjusted R-squared: 0.3644
## F-statistic: 15.91 on 22 and 550 DF, p-value: < 2.2e-16
As a result, we come up with a model imdb_rating ~ + imdb_num_votes + genre + thtr_rel_year + mpaa_rating + runtime + best_pic_nom + dvd_rel_month + best_dir_win + dvd_rel_year with adjusted R2 of 0.3643941 which is slightly higher than that of a full model (0.3605187). However, we can see that some of the variables have a p-value above the significance level of 5%. The place of a certain variable in the formula tells us about its impact on the adjusted R2 (the earlier in the formula, the greater the impact).
Let’s try a backward elimination technique and compare the resulting model with the model above. Backward elimination starts with the model that includes all potential predictor variables. Variables are eliminated one-at-a-time from the model until we cannot improve the adjusted R2. At each step, we eliminate the variable that leads to the largest improvement in adjusted R2.
#predictors <- model.matrix(imdb_full)[,-1] # move the names of the predictors to matrix
bwrd_bestpredictors = c("genre", "runtime", "mpaa_rating", "thtr_rel_year", "thtr_rel_month", "dvd_rel_year", "dvd_rel_month", "imdb_num_votes", "best_pic_nom", "best_actor_win", "best_actress_win", "best_dir_win", "top200_box")
midpredictors = bwrd_bestpredictors
bwrd_result <- summary(imdb_full)$adj.r.squared
test_predictors = c()
while (length(bwrd_bestpredictors) > 0) {
i = 1
adjr2 = c()
while (i <= length(midpredictors)) {
test_predictors = midpredictors[-(i)]
midformula = as.formula(paste("imdb_rating ~ ", paste(test_predictors, collapse = " + "), sep = ""))
midmodel = lm(midformula, data = ff_imdb)
midres = summary(midmodel)$adj.r.squared
adjr2 = append(adjr2, midres, after = length(adjr2))
i = i + 1
}
if (max(adjr2) > bwrd_result) {
midpredictors = midpredictors[-(which.max(adjr2))]
bwrd_bestpredictors = midpredictors
bwrd_result = max(adjr2)
} else {break}
}
bwrd_bestformula = as.formula(paste("imdb_rating ~ ", paste(bwrd_bestpredictors, collapse = " + "), sep = ""))
imdb_bwrd_best_model = lm(bwrd_bestformula, data = ff_imdb)
summary(imdb_bwrd_best_model)##
## Call:
## lm(formula = bwrd_bestformula, data = ff_imdb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8427 -0.4093 0.0707 0.5695 2.0833
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.851e+01 1.688e+01 2.873 0.004222 **
## genreAnimation -2.238e-01 3.497e-01 -0.640 0.522592
## genreArt House & International 9.483e-01 2.786e-01 3.404 0.000713 ***
## genreComedy -2.578e-02 1.434e-01 -0.180 0.857393
## genreDocumentary 1.020e+00 5.052e-01 2.019 0.043925 *
## genreDrama 6.829e-01 1.228e-01 5.558 4.25e-08 ***
## genreHorror -9.543e-02 2.167e-01 -0.440 0.659819
## genreMusical & Performing Arts 1.193e+00 3.186e-01 3.745 0.000200 ***
## genreMystery & Suspense 4.269e-01 1.594e-01 2.678 0.007637 **
## genreOther 4.683e-01 2.467e-01 1.898 0.058179 .
## genreScience Fiction & Fantasy -9.631e-02 3.166e-01 -0.304 0.761131
## runtime 5.503e-03 2.472e-03 2.226 0.026401 *
## mpaa_ratingNC-17 -1.778e-01 8.796e-01 -0.202 0.839905
## mpaa_ratingPG -5.916e-01 2.581e-01 -2.293 0.022245 *
## mpaa_ratingPG-13 -8.009e-01 2.683e-01 -2.985 0.002958 **
## mpaa_ratingR -5.131e-01 2.597e-01 -1.976 0.048673 *
## mpaa_ratingUnrated 1.471e-01 3.526e-01 0.417 0.676637
## thtr_rel_year -1.062e-02 4.774e-03 -2.225 0.026497 *
## dvd_rel_year -1.084e-02 1.043e-02 -1.039 0.299084
## dvd_rel_month 1.902e-02 1.048e-02 1.814 0.070152 .
## imdb_num_votes 3.459e-06 3.537e-07 9.779 < 2e-16 ***
## best_pic_nomyes 3.846e-01 1.992e-01 1.931 0.054054 .
## best_dir_winyes 2.028e-01 1.407e-01 1.441 0.150160
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8385 on 550 degrees of freedom
## Multiple R-squared: 0.3888, Adjusted R-squared: 0.3644
## F-statistic: 15.91 on 22 and 550 DF, p-value: < 2.2e-16
We ended up with the same model using the backward elimination technique. We should run a diagnostic for the following predictors included in a model:
genre, runtime, mpaa_rating, thtr_rel_year, dvd_rel_year, dvd_rel_month, imdb_num_votes, best_pic_nom, best_dir_win
To assess whether the multiple regression model is reliable, we need to check for:
Let’s take a look at the Normal probability plot of residuals for our model.
ggplot(data = best_imdb_fwd_model, aes(sample = .resid)) +
stat_qq(col = "orange", alpha = 0.5) +
ggtitle("Plot 14. Normal probability plot of the residuals for IMDb rating prediction model") +
geom_abline(colour = "blue")We can see some fluctuations from a normal model. They are not extreme, though. There are no outliers that might be a cause of concern. In a normal probability plot for residuals, we tend to be most worried about residuals that appear to be outliers, since these indicate long tails in the distribution of residuals.
Absolute values of residuals against fitted values. The following plot is helpful to check the condition that the variance of the residuals is approximately constant.
ggplot(data = best_imdb_fwd_model, aes(x = round(.fitted, 0), y = abs(.resid))) +
geom_jitter(col = 'orange', alpha = 0.7) +
labs(title = "Plot 15. Absolute values of residuals vs fitted for IMDb rating model", x = "Fitted values of IMDb rating (rounded to the nearest whole number)", y = "Absolute values of residuals")We don’t see any obvious deviations from constant variance in this plot. As there are not many movies with IMDb ratings around 8, 9, or 10 in the data set, we don’t know if the variance for these values remains constant. However, there is no evidence that it doesn’t.
Residuals in order of their data collection. Such a plot is helpful in identifying any connection between cases that are close to one another. If it seems that consecutive observations tend to be close to each other, this indicates the independence assumption of the observations would fail. We know that the data set represents a random selection of movies, so we don’t expect any problems here.
As expected, here we see no structure that indicates a problem.
Last thing we need to check is Residuals against each predictor variable. The first row of the graphics below shows the residual plots:
#ggscatmat(ff_imdb, columns = c("imdb_rating", "imdb_num_votes", "runtime", "thtr_rel_year", "dvd_rel_month"), alpha = 0.5)
ggnostic(imdb_bwrd_best_model, mapping = ggplot2::aes(color = imdb_rating)) +
labs(title = "Plot 17. Residuals of the model against each predictor variable")Let’s look at the categorical variables first. For the genre variable we see a lot of deviations from constant variability between different genres here. The variability among the groups in mpaa_rating seems a little bit more constant. For example, Unrated movies (there are only 16 of them in a data frame) have significantly less variability than others and NC-17 has no variability at all (in fact, there’s only one movie of this category in a data set). Besides that, this variable looks ok. Both best_pic_nom, best_dir_win variables show the inconstant variability (the latter in a lesser extent, though). On the one hand, this can be explained by a significantly smaller number of movies or directors that have been nominated for Oscar. However, the reals reason is not clear from the data.
The numerical variables show more uniformity overall. With theatrical release year (thtr_rel_year) and DVD release year (dvd_rel_year) we don’t see a structure (distribution around 0 seems to be random). There might be some remaining structure in the DVD release month variable, though. We can see a little ‘wave’ going up and down here. There’s a clear ‘funneling’ in both runtime and imdb_num_votes variables. For instance, in the IMDb number of votes, prediction for the movies with lower number of votes has lower accuracy (larger difference between predicted and observed value) than for those with a large number of votes - this is not something unexpected as point estimate is getting closer to the actual population parameter as the sample size increases. It’s harder to explain why the accuracy of the model goes down for the shorter movies, though.
It’s worth mentioning that the following variables have the p-value above the significance level of 5%: best_dir_win, best_pic_nom, dvd_rel_year, and dvd_rel_month.
The table below shows the coefficients (or point estimates of the coefficients) for each predictor in our model.
summary(imdb_bwrd_best_model)$coeff %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), fixed_thead = T)| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 48.5063482 | 16.8831814 | 2.8730573 | 0.0042222 |
| genreAnimation | -0.2237513 | 0.3497397 | -0.6397652 | 0.5225916 |
| genreArt House & International | 0.9482872 | 0.2785873 | 3.4039143 | 0.0007127 |
| genreComedy | -0.0257799 | 0.1433986 | -0.1797776 | 0.8573934 |
| genreDocumentary | 1.0201916 | 0.5051834 | 2.0194481 | 0.0439252 |
| genreDrama | 0.6828532 | 0.1228487 | 5.5584900 | 0.0000000 |
| genreHorror | -0.0954325 | 0.2166944 | -0.4404014 | 0.6598194 |
| genreMusical & Performing Arts | 1.1931570 | 0.3186205 | 3.7447588 | 0.0001996 |
| genreMystery & Suspense | 0.4268653 | 0.1594225 | 2.6775726 | 0.0076373 |
| genreOther | 0.4682855 | 0.2466858 | 1.8983072 | 0.0581788 |
| genreScience Fiction & Fantasy | -0.0963062 | 0.3166432 | -0.3041474 | 0.7611306 |
| runtime | 0.0055027 | 0.0024717 | 2.2262654 | 0.0264007 |
| mpaa_ratingNC-17 | -0.1777673 | 0.8795546 | -0.2021106 | 0.8399050 |
| mpaa_ratingPG | -0.5916361 | 0.2580573 | -2.2926539 | 0.0222446 |
| mpaa_ratingPG-13 | -0.8008963 | 0.2682718 | -2.9853911 | 0.0029582 |
| mpaa_ratingR | -0.5131497 | 0.2597135 | -1.9758299 | 0.0486734 |
| mpaa_ratingUnrated | 0.1471493 | 0.3526396 | 0.4172796 | 0.6766367 |
| thtr_rel_year | -0.0106217 | 0.0047742 | -2.2248402 | 0.0264969 |
| dvd_rel_year | -0.0108412 | 0.0104305 | -1.0393807 | 0.2990844 |
| dvd_rel_month | 0.0190193 | 0.0104821 | 1.8144594 | 0.0701517 |
| imdb_num_votes | 0.0000035 | 0.0000004 | 9.7788640 | 0.0000000 |
| best_pic_nomyes | 0.3845828 | 0.1992098 | 1.9305415 | 0.0540535 |
| best_dir_winyes | 0.2027599 | 0.1407099 | 1.4409774 | 0.1501601 |
In this case, Intercept is a point where the regression line intercepts the y-axis represented by imdb_rating. The coefficient of 0.0055 for the runtime variable means that the average difference in IMDb rating for each additional minute of the runtime is +0.0055 when holding the other variables constant.
We can see that some of the variables have a larger impact on the IMDb rating. For example, if the MPAA rating of a movie is PG-13 (“Parental Guidance: some material may be inappropriate for children 13 and under”) reduces the IMDb rating of the movie by 0.8. A movie in the Musical & Performing Arts category increases the rating by 1.19. It’s the largest single factor.
The number of votes on IMDb (imdb_num_votes) seems to be having the least impact on the movie’s IMDb rating: each additional vote adds up only 0.0000035 to the rating. A movie needs a million votes to see a rating increased by 3.5. Considering that a movie in our data set gets on average only 62,861 votes the impact on the rating is not very large (0.22 on average).
As expected, both the DVD release year and month have just a tiny impact on IMDb rating (if any, considering their p-values above the significance level). The Best Picture nomination adds up 0.38 to the rating.
To check our model’s prediction accuracy we picked a movie that is not in the initial data set and that was released in 2016. This movie is “La La Land” (IMDb link: www.imdb.com/title/tt3783958/). As of December 17, 2019, it has an IMDb rating of 8.0 and 456,399 votes. The movie and its director, Damien Chazelle, were both nominated and won Oscars in 2017. The rest of the parameters has been combined in a data frame lalaland.
We use a predict function and use both our model and the lalaland data as an input:
lalaland <- data.frame(genre = "Comedy", runtime = 128, mpaa_rating = "PG-13", thtr_rel_year = 2016, dvd_rel_year = 2017, dvd_rel_month = 4, imdb_num_votes = 456399, best_pic_nom = "yes", best_dir_win = "yes")
lalaland_prediction <- predict(imdb_bwrd_best_model, lalaland, interval = "prediction", level = 0.95)
lalaland_prediction## fit lwr upr
## 1 7.345815 5.621457 9.070173
The model predicts an IMDb rating of 7.35 for this movie. The actual rating is 8.0. This means that the model underestimated real rating. The lwr and upr values show lower and upper bounds of a confidence interval of the prediction with a confidence level of 95%. We’re 95% confident that the actual IMDb rating of this movie is between 5.62 and 9.07. The interval is quite wide. This, in fact, agrees with the adjusted R2 value of 0.3644 meaning that only 36.44% of the variation in the rating is explained by the model.
We use a movie released in 2016 despite the fact that our model is based on data about movies released before 2016. This implies an extrapolation which makes our prediction even less reliable.
In this research, we used factual data on 600+ movies and built a multiple regression model that predicts the IMDb rating. The Adjusted R2 was used as an estimate of explained variance. We tried two techniques of stepwise model selection (forward selection and backward elimination) which came up with the same model that explains approximately 36.44% of the variation in movie popularity measured by IMDb rating.
Answering the research question, we can predict the popularity of a movie without watching it but to a certain limited extent.
Fortunately for us, movie lovers, the popularity of a certain piece of art is based on something that is hard to measure and categorize. And this ‘something’ explains a good part of those remaining 63.56%.
# The following two commands remove any previously installed H2O packages for R.
# if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
# if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }
#
# # Next, we download packages that H2O depends on.
# pkgs <- c("RCurl","jsonlite")
# for (pkg in pkgs) {
# if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }
# }
#
# # Now we download, install and initialize the H2O package for R.
# install.packages("h2o", type="source", repos="http://h2o-release.s3.amazonaws.com/h2o/rel-yu/1/R")
# Finally, let's load H2O and start up an H2O cluster
library(h2o)##
## ----------------------------------------------------------------------
##
## Your next step is to start H2O:
## > h2o.init()
##
## For H2O package documentation, ask for help:
## > ??h2o
##
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
##
## ----------------------------------------------------------------------
##
## Attaching package: 'h2o'
## The following objects are masked from 'package:stats':
##
## cor, sd, var
## The following objects are masked from 'package:base':
##
## &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
## colnames<-, ifelse, is.character, is.factor, is.numeric, log,
## log10, log1p, log2, round, signif, trunc
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 6 days 22 hours
## H2O cluster timezone: America/Phoenix
## H2O data parsing timezone: UTC
## H2O cluster version: 3.28.0.1
## H2O cluster version age: 1 month and 8 days
## H2O cluster name: H2O_started_from_R_vk_jet_kvi422
## H2O cluster total nodes: 1
## H2O cluster total memory: 0.83 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
## R Version: R version 3.6.2 (2019-12-12)
# Let's do some tweaks to our data set. We want to round our imdb rating column to avoid very large Confusion Matrix (instead of showing five rows 8.0, 8.1, 8.2, 8.3, 8.4 we would only have 8.0). Hopefully prediction model would also improve.
ff_imdb_rnd <- ff_imdb %>%
mutate(imdb_rating_rounded = round(imdb_rating)) %>%
select(imdb_rating, imdb_rating_rounded, everything())
ff_imdb_h2o <- as.h2o(ff_imdb_rnd)##
|
| | 0%
|
|======================================================================| 100%
ff_imdb_h2o[,2] <- as.factor(ff_imdb_h2o[,2])
parts <- h2o.splitFrame(ff_imdb_h2o, 0.8)
train <- parts[[1]]
test <- parts[[2]]
summary(train)## Warning in summary.H2OFrame(train): Approximated quantiles computed! If you are
## interested in exact quantiles, please pass the `exact_quantiles=TRUE` parameter.
## imdb_rating imdb_rating_rounded genre runtime
## Min. :2.300 6:181 Drama :235 Min. : 68
## 1st Qu.:5.871 7:145 Comedy : 65 1st Qu.: 94
## Median :6.497 8: 61 Action & Adventure: 48 Median :105
## Mean :6.394 5: 36 Mystery & Suspense: 39 Mean :107
## 3rd Qu.:7.099 4: 20 Horror : 17 3rd Qu.:118
## Max. :8.500 3: 3 Other : 14 Max. :194
## mpaa_rating thtr_rel_year thtr_rel_month dvd_rel_year dvd_rel_month
## R :233 Min. :1972 Min. : 1.000 Min. :1991 Min. : 1.000
## PG-13 :105 1st Qu.:1991 1st Qu.: 4.000 1st Qu.:2001 1st Qu.: 3.000
## PG : 84 Median :2000 Median : 6.000 Median :2003 Median : 6.000
## Unrated: 15 Mean :1998 Mean : 6.616 Mean :2004 Mean : 6.297
## G : 11 3rd Qu.:2006 3rd Qu.:10.000 3rd Qu.:2007 3rd Qu.: 9.000
## NC-17 : 0 Max. :2014 Max. :12.000 Max. :2015 Max. :12.000
## imdb_num_votes best_pic_nom best_actor_win best_actress_win best_dir_win
## Min. : 390 no :431 no :383 no :392 no :415
## 1st Qu.: 5746 yes: 17 yes: 65 yes: 56 yes: 33
## Median : 18242
## Mean : 63912
## 3rd Qu.: 63989
## Max. :893008
## top200_box
## no :434
## yes: 14
##
##
##
##
## [1] 448
## [1] 125
##
|
| | 0%
|
|========================================== | 60%
|
|======================================================================| 100%
## Model Details:
## ==============
##
## H2OMultinomialModel: deeplearning
## Model ID: DeepLearning_model_R_1579320800507_105
## Status of Neuron Layers: predicting imdb_rating_rounded, 8-class classification, multinomial distribution, CrossEntropy loss, 50 008 weights/biases, 596,1 KB, 4 480 training samples, mini-batch size 1
## layer units type dropout l1 l2 mean_rate rate_rms momentum
## 1 1 40 Input 0.00 % NA NA NA NA NA
## 2 2 200 Rectifier 0.00 % 0.000000 0.000000 0.203462 0.403377 0.000000
## 3 3 200 Rectifier 0.00 % 0.000000 0.000000 0.005006 0.002772 0.000000
## 4 4 8 Softmax NA 0.000000 0.000000 0.134806 0.339789 0.000000
## mean_weight weight_rms mean_bias bias_rms
## 1 NA NA NA NA
## 2 -0.008759 0.094928 0.413450 0.036820
## 3 -0.005816 0.071364 0.981269 0.020033
## 4 -0.008355 0.392839 -0.012372 0.007795
##
##
## H2OMultinomialMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on full training frame **
##
## Training Set Metrics:
## =====================
##
## Extract training frame with `h2o.getFrame("RTMP_sid_9d06_6")`
## MSE: (Extract with `h2o.mse`) 0.3082136
## RMSE: (Extract with `h2o.rmse`) 0.5551699
## Logloss: (Extract with `h2o.logloss`) 0.9386761
## Mean Per-Class Error: 0.4638141
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
## 2 3 4 5 6 7 8 9 Error Rate
## 2 1 0 0 0 1 0 0 0 0.5000 = 1 / 2
## 3 0 0 0 1 1 1 0 0 1.0000 = 3 / 3
## 4 0 0 0 8 11 1 0 0 1.0000 = 20 / 20
## 5 0 0 0 28 6 2 0 0 0.2222 = 8 / 36
## 6 0 0 0 15 139 19 8 0 0.2320 = 42 / 181
## 7 0 0 0 8 44 71 22 0 0.5103 = 74 / 145
## 8 0 0 0 2 7 6 46 0 0.2459 = 15 / 61
## 9 0 0 0 0 0 0 0 0 NA = 0 / 0
## Totals 1 0 0 62 209 100 76 0 0.3638 = 163 / 448
##
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
## =======================================================================
## Top-8 Hit Ratios:
## k hit_ratio
## 1 1 0.636161
## 2 2 0.848214
## 3 3 0.957589
## 4 4 0.984375
## 5 5 0.997768
## 6 6 1.000000
## 7 7 1.000000
## 8 8 1.000000
h2o.confusionMatrix(mDL) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), fixed_thead = T)| 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Error | Rate | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0.5000000 | 1 / 2 |
| 3 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1.0000000 | 3 / 3 |
| 4 | 0 | 0 | 0 | 8 | 11 | 1 | 0 | 0 | 1.0000000 | 20 / 20 |
| 5 | 0 | 0 | 0 | 28 | 6 | 2 | 0 | 0 | 0.2222222 | 8 / 36 |
| 6 | 0 | 0 | 0 | 15 | 139 | 19 | 8 | 0 | 0.2320442 | 42 / 181 |
| 7 | 0 | 0 | 0 | 8 | 44 | 71 | 22 | 0 | 0.5103448 | 74 / 145 |
| 8 | 0 | 0 | 0 | 2 | 7 | 6 | 46 | 0 | 0.2459016 | 15 / 61 |
| 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | 0 / 0 |
| Totals | 1 | 0 | 0 | 62 | 209 | 100 | 76 | 0 | 0.3638393 | 163 / 448 |
## H2OMultinomialMetrics: deeplearning
##
## Test Set Metrics:
## =====================
##
## MSE: (Extract with `h2o.mse`) 0.4921508
## RMSE: (Extract with `h2o.rmse`) 0.7015346
## Logloss: (Extract with `h2o.logloss`) 1.909072
## Mean Per-Class Error: 0.809222
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
## 2 3 4 5 6 7 8 9 Error Rate
## 2 0 0 0 0 1 0 0 0 1.0000 = 1 / 1
## 3 0 0 0 1 2 0 0 0 1.0000 = 3 / 3
## 4 0 0 0 2 5 0 0 0 1.0000 = 7 / 7
## 5 0 0 0 3 9 1 0 0 0.7692 = 10 / 13
## 6 0 0 0 9 19 8 2 0 0.5000 = 19 / 38
## 7 0 0 0 1 19 10 10 0 0.7500 = 30 / 40
## 8 0 0 0 1 4 5 12 0 0.4545 = 10 / 22
## 9 0 0 0 0 0 0 1 0 1.0000 = 1 / 1
## Totals 0 0 0 17 59 24 25 0 0.6480 = 81 / 125
##
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
## =======================================================================
## Top-8 Hit Ratios:
## k hit_ratio
## 1 1 0.352000
## 2 2 0.696000
## 3 3 0.848000
## 4 4 0.944000
## 5 5 0.968000
## 6 6 0.984000
## 7 7 0.992000
## 8 8 1.000000
##
|
| | 0%
|
|= | 2%
|
|== | 2%
|
|=== | 5%
|
|==== | 5%
|
|==== | 6%
|
|===== | 8%
|
|======== | 11%
|
|========== | 14%
|
|=========== | 16%
|
|============ | 18%
|
|============== | 19%
|
|================ | 22%
|
|================= | 24%
|
|=================== | 28%
|
|====================== | 31%
|
|======================= | 33%
|
|======================== | 34%
|
|======================== | 35%
|
|========================= | 35%
|
|========================= | 36%
|
|=========================================== | 61%
|
|=========================================== | 62%
|
|============================================ | 62%
|
|============================================ | 63%
|
|====================================================== | 78%
|
|======================================================= | 78%
|
|======================================================= | 79%
|
|======================================================== | 79%
|
|======================================================== | 80%
|
|======================================================== | 81%
|
|========================================================= | 81%
|
|========================================================= | 82%
|
|========================================================== | 82%
|
|========================================================== | 83%
|
|=========================================================== | 84%
|
|=========================================================== | 85%
|
|============================================================ | 85%
|
|============================================================ | 86%
|
|============================================================= | 87%
|
|============================================================= | 88%
|
|============================================================== | 88%
|
|============================================================== | 89%
|
|=============================================================== | 89%
|
|=============================================================== | 90%
|
|=============================================================== | 91%
|
|================================================================ | 91%
|
|================================================================ | 92%
|
|================================================================= | 92%
|
|================================================================= | 93%
|
|================================================================== | 94%
|
|==================================================================== | 97%
|
|======================================================================| 100%
## model_id mean_per_class_error
## 1 DeepLearning_grid__1_AutoML_20200124_195055_model_1 0.6382081
## 2 GBM_4_AutoML_20200124_195055 0.6435266
## 3 DeepLearning_grid__1_AutoML_20200124_195055_model_3 0.6462788
## 4 GBM_grid__1_AutoML_20200124_195055_model_23 0.6514436
## 5 DeepLearning_grid__2_AutoML_20200124_195055_model_1 0.6527988
## 6 GBM_3_AutoML_20200124_195055 0.6565103
## 7 GBM_grid__1_AutoML_20200124_195055_model_17 0.6584141
## 8 GBM_grid__1_AutoML_20200124_195055_model_18 0.6594116
## 9 GBM_2_AutoML_20200124_195055 0.6595320
## 10 DRF_1_AutoML_20200124_195055 0.6620231
## 11 GBM_grid__1_AutoML_20200124_195055_model_16 0.6638960
## 12 XRT_1_AutoML_20200124_195055 0.6656215
## 13 GBM_grid__1_AutoML_20200124_195055_model_26 0.6669583
## 14 GBM_grid__1_AutoML_20200124_195055_model_2 0.6675738
## 15 XGBoost_grid__1_AutoML_20200124_195055_model_5 0.6679752
## 16 GBM_grid__1_AutoML_20200124_195055_model_21 0.6684557
## 17 XGBoost_grid__1_AutoML_20200124_195055_model_4 0.6690985
## 18 DeepLearning_grid__1_AutoML_20200124_195055_model_2 0.6693460
## 19 XGBoost_grid__1_AutoML_20200124_195055_model_7 0.6704557
## 20 DeepLearning_grid__1_AutoML_20200124_195055_model_4 0.6714398
## 21 GBM_grid__1_AutoML_20200124_195055_model_10 0.6734082
## 22 GBM_grid__1_AutoML_20200124_195055_model_4 0.6739245
## 23 XGBoost_1_AutoML_20200124_195055 0.6746378
## 24 GBM_1_AutoML_20200124_195055 0.6758915
## 25 DeepLearning_grid__3_AutoML_20200124_195055_model_1 0.6797380
## 26 GBM_grid__1_AutoML_20200124_195055_model_28 0.6799795
## 27 GBM_grid__1_AutoML_20200124_195055_model_5 0.6811640
## 28 XGBoost_3_AutoML_20200124_195055 0.6823559
## 29 GBM_grid__1_AutoML_20200124_195055_model_15 0.6825816
## 30 DeepLearning_grid__3_AutoML_20200124_195055_model_2 0.6841817
## 31 GBM_grid__1_AutoML_20200124_195055_model_19 0.6848488
## 32 GBM_grid__1_AutoML_20200124_195055_model_8 0.6887564
## 33 XGBoost_grid__1_AutoML_20200124_195055_model_2 0.6915332
## 34 GBM_grid__1_AutoML_20200124_195055_model_11 0.6922114
## 35 GBM_grid__1_AutoML_20200124_195055_model_29 0.6923116
## 36 DeepLearning_grid__2_AutoML_20200124_195055_model_2 0.6924489
## 37 XGBoost_2_AutoML_20200124_195055 0.6963317
## 38 GBM_grid__1_AutoML_20200124_195055_model_24 0.7002770
## 39 GBM_grid__1_AutoML_20200124_195055_model_13 0.7005028
## 40 DeepLearning_grid__2_AutoML_20200124_195055_model_3 0.7026830
## 41 GBM_grid__1_AutoML_20200124_195055_model_20 0.7032008
## 42 XGBoost_grid__1_AutoML_20200124_195055_model_3 0.7065955
## 43 DeepLearning_1_AutoML_20200124_195055 0.7095069
## 44 StackedEnsemble_AllModels_AutoML_20200124_195055 0.7112976
## 45 GBM_grid__1_AutoML_20200124_195055_model_22 0.7138926
## 46 XGBoost_grid__1_AutoML_20200124_195055_model_6 0.7192156
## 47 GBM_grid__1_AutoML_20200124_195055_model_6 0.7195311
## 48 GBM_grid__1_AutoML_20200124_195055_model_12 0.7215755
## 49 GBM_5_AutoML_20200124_195055 0.7216125
## 50 GBM_grid__1_AutoML_20200124_195055_model_27 0.7226234
## 51 GBM_grid__1_AutoML_20200124_195055_model_14 0.7235844
## 52 GBM_grid__1_AutoML_20200124_195055_model_9 0.7243380
## 53 GBM_grid__1_AutoML_20200124_195055_model_1 0.7245368
## 54 DeepLearning_grid__1_AutoML_20200124_195055_model_5 0.7247893
## 55 GBM_grid__1_AutoML_20200124_195055_model_3 0.7259896
## 56 GBM_grid__1_AutoML_20200124_195055_model_7 0.7300343
## 57 XGBoost_grid__1_AutoML_20200124_195055_model_1 0.7327777
## 58 DeepLearning_grid__3_AutoML_20200124_195055_model_3 0.7358188
## 59 StackedEnsemble_BestOfFamily_AutoML_20200124_195055 0.7500000
## 60 GLM_1_AutoML_20200124_195055 0.7500000
## 61 GBM_grid__1_AutoML_20200124_195055_model_25 0.7520922
## logloss rmse mse
## 1 2.874357 0.6821987 0.4653951
## 2 1.469201 0.6700009 0.4489012
## 3 1.565344 0.6396374 0.4091360
## 4 1.437569 0.6806231 0.4632477
## 5 3.359979 0.6843742 0.4683681
## 6 1.471811 0.6751921 0.4558844
## 7 22.859580 0.8145799 0.6635405
## 8 2.029109 0.8684317 0.7541736
## 9 1.509593 0.6783006 0.4600917
## 10 2.306275 0.6653441 0.4426828
## 11 1.485650 0.7155002 0.5119405
## 12 1.648765 0.6677945 0.4459494
## 13 1.771981 0.8248887 0.6804414
## 14 1.864230 0.6709628 0.4501910
## 15 1.388091 0.7085196 0.5020000
## 16 1.513849 0.6852810 0.4696101
## 17 1.372579 0.7008606 0.4912056
## 18 1.820122 0.6687782 0.4472642
## 19 1.371367 0.7003057 0.4904280
## 20 1.419923 0.6516111 0.4245970
## 21 1.794830 0.8300555 0.6889921
## 22 1.785640 0.8273814 0.6845600
## 23 1.329911 0.6860047 0.4706024
## 24 1.588559 0.6871091 0.4721189
## 25 1.431426 0.6407995 0.4106240
## 26 1.793407 0.8295642 0.6881767
## 27 1.831146 0.8364072 0.6995771
## 28 1.361331 0.7055882 0.4978547
## 29 1.879541 0.8456149 0.7150646
## 30 1.396300 0.6541300 0.4278861
## 31 1.790738 0.8290682 0.6873542
## 32 1.766026 0.8248022 0.6802987
## 33 1.434423 0.7365913 0.5425667
## 34 7.673091 0.7475792 0.5588746
## 35 1.438542 0.7306918 0.5339105
## 36 1.291925 0.6589075 0.4341590
## 37 1.387477 0.7210912 0.5199725
## 38 3.187138 0.7131369 0.5085642
## 39 1.472999 0.6875668 0.4727481
## 40 1.712796 0.6743905 0.4548026
## 41 1.813074 0.8339442 0.6954629
## 42 1.398919 0.7258960 0.5269250
## 43 1.343313 0.6800739 0.4625005
## 44 1.340312 0.6886199 0.4741974
## 45 1.807818 0.8312863 0.6910369
## 46 1.398088 0.7227731 0.5224009
## 47 1.902870 0.8496652 0.7219309
## 48 2.040000 0.8699202 0.7567612
## 49 1.403047 0.7152745 0.5116176
## 50 1.857853 0.8421254 0.7091753
## 51 1.777875 0.8273437 0.6844977
## 52 1.837808 0.8384675 0.7030278
## 53 1.414065 0.7220073 0.5212945
## 54 1.497135 0.6713075 0.4506537
## 55 1.905398 0.8499961 0.7224934
## 56 1.541743 0.7683763 0.5904021
## 57 1.632353 0.7944841 0.6312049
## 58 1.399945 0.7021186 0.4929705
## 59 1.411271 0.7174727 0.5147671
## 60 1.459866 0.7170132 0.5141079
## 61 20.216931 0.8216195 0.6750585
##
## [61 rows x 5 columns]
## H2OMultinomialMetrics: deeplearning
##
## Test Set Metrics:
## =====================
##
## MSE: (Extract with `h2o.mse`) 0.5255961
## RMSE: (Extract with `h2o.rmse`) 0.7249801
## Logloss: (Extract with `h2o.logloss`) 4.462709
## Mean Per-Class Error: 0.7976333
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
## 2 3 4 5 6 7 8 9 Error Rate
## 2 0 0 0 0 1 0 0 0 1.0000 = 1 / 1
## 3 0 0 0 0 3 0 0 0 1.0000 = 3 / 3
## 4 0 0 1 1 5 0 0 0 0.8571 = 6 / 7
## 5 0 0 0 0 10 3 0 0 1.0000 = 13 / 13
## 6 0 1 0 1 25 8 3 0 0.3421 = 13 / 38
## 7 0 0 0 0 18 20 2 0 0.5000 = 20 / 40
## 8 0 0 0 0 4 11 7 0 0.6818 = 15 / 22
## 9 0 0 0 0 0 0 1 0 1.0000 = 1 / 1
## Totals 0 1 1 2 66 42 13 0 0.5760 = 72 / 125
##
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
## =======================================================================
## Top-8 Hit Ratios:
## k hit_ratio
## 1 1 0.424000
## 2 2 0.688000
## 3 3 0.880000
## 4 4 0.936000
## 5 5 0.976000
## 6 6 0.992000
## 7 7 1.000000
## 8 8 1.000000
Here, we try to get a better model by training on all variables (excluding those related to Rotten Tomatoes) without manually excluding some of them as we did for linear regression tests.
movies_edited <- movies %>%
mutate(imdb_rating_rounded = round(imdb_rating)) %>%
select(imdb_rating, critics_rating, critics_score, audience_rating, audience_score, everything())
movies_h2o <- as.h2o(movies_edited)##
|
| | 0%
|
|======================================================================| 100%
movies_h2o[,33] <- as.factor(movies_h2o[,33])
movies_h2o[,1] <- as.factor(movies_h2o[,1])
parts1 <- h2o.splitFrame(movies_h2o, 0.7)
train1 <- parts1[[1]]
test1 <- parts1[[2]]
aML_movies <- h2o.automl(6:32, 33, train1, max_runtime_secs = 180, nfolds = 15)##
|
| | 0%
|
|= | 1%
|
|= | 2%
|
|== | 3%
|
|=== | 4%
|
|==== | 5%
|
|==== | 6%
|
|===== | 7%
|
|===== | 8%
|
|====== | 8%
|
|======= | 10%
|
|======== | 11%
|
|======== | 12%
|
|========= | 13%
|
|========== | 14%
|
|=========== | 15%
|
|=========== | 16%
|
|============ | 17%
|
|============ | 18%
|
|============= | 19%
|
|============== | 19%
|
|============== | 20%
|
|=============== | 21%
|
|================ | 22%
|
|================ | 23%
|
|================= | 24%
|
|================== | 25%
|
|================== | 26%
|
|=================== | 27%
|
|==================== | 28%
|
|===================== | 29%
|
|===================== | 30%
|
|====================== | 31%
|
|======================= | 32%
|
|======================= | 33%
|
|======================== | 34%
|
|======================== | 35%
|
|========================= | 36%
|
|========================== | 37%
|
|========================== | 38%
|
|=========================== | 39%
|
|============================ | 40%
|
|============================= | 41%
|
|============================== | 42%
|
|============================== | 43%
|
|=============================== | 44%
|
|================================ | 45%
|
|================================ | 46%
|
|================================= | 47%
|
|================================== | 48%
|
|=================================== | 49%
|
|=================================== | 50%
|
|==================================== | 51%
|
|===================================== | 52%
|
|===================================== | 53%
|
|====================================== | 54%
|
|======================================= | 55%
|
|======================================= | 56%
|
|======================================== | 57%
|
|========================================= | 58%
|
|========================================== | 60%
|
|=========================================== | 61%
|
|=========================================== | 62%
|
|============================================ | 63%
|
|============================================= | 64%
|
|============================================= | 65%
|
|============================================== | 66%
|
|=============================================== | 66%
|
|=============================================== | 67%
|
|================================================ | 68%
|
|================================================ | 69%
|
|================================================= | 70%
|
|================================================== | 71%
|
|================================================== | 72%
|
|=================================================== | 73%
|
|==================================================== | 74%
|
|==================================================== | 75%
|
|===================================================== | 76%
|
|====================================================== | 77%
|
|======================================================= | 78%
|
|======================================================= | 79%
|
|======================================================== | 80%
|
|========================================================= | 81%
|
|========================================================= | 82%
|
|========================================================== | 83%
|
|=========================================================== | 84%
|
|============================================================ | 85%
|
|============================================================= | 87%
|
|============================================================== | 88%
|
|============================================================== | 89%
|
|=================================================================== | 96%
|
|======================================================================| 100%
## model_id mean_per_class_error
## 1 XGBoost_3_AutoML_20200124_195958 0.7886034
## 2 XGBoost_grid__1_AutoML_20200124_195958_model_2 0.7917549
## 3 XGBoost_1_AutoML_20200124_195958 0.7924184
## 4 DeepLearning_1_AutoML_20200124_195958 0.8036197
## 5 XGBoost_grid__1_AutoML_20200124_195958_model_3 0.8065900
## 6 XGBoost_grid__1_AutoML_20200124_195958_model_1 0.8096159
## 7 GBM_grid__1_AutoML_20200124_195958_model_5 0.8099997
## 8 DeepLearning_grid__1_AutoML_20200124_195958_model_1 0.8107425
## 9 XGBoost_grid__1_AutoML_20200124_195958_model_5 0.8138672
## 10 XRT_1_AutoML_20200124_195958 0.8139340
## 11 GBM_1_AutoML_20200124_195958 0.8159317
## 12 GBM_2_AutoML_20200124_195958 0.8258173
## 13 GBM_3_AutoML_20200124_195958 0.8269731
## 14 GBM_4_AutoML_20200124_195958 0.8295011
## 15 DRF_1_AutoML_20200124_195958 0.8340427
## 16 XGBoost_2_AutoML_20200124_195958 0.8343168
## 17 GBM_grid__1_AutoML_20200124_195958_model_4 0.8387184
## 18 GBM_grid__1_AutoML_20200124_195958_model_1 0.8433453
## 19 GBM_grid__1_AutoML_20200124_195958_model_3 0.8434071
## 20 GBM_grid__1_AutoML_20200124_195958_model_2 0.8466630
## 21 XGBoost_grid__1_AutoML_20200124_195958_model_4 0.8484848
## 22 GBM_5_AutoML_20200124_195958 0.8592283
## 23 GLM_1_AutoML_20200124_195958 0.8750000
## 24 StackedEnsemble_AllModels_AutoML_20200124_195958 0.8750000
## 25 StackedEnsemble_BestOfFamily_AutoML_20200124_195958 0.8750000
## logloss rmse mse
## 1 1.386488 0.7127303 0.5079845
## 2 1.372170 0.7064156 0.4990229
## 3 1.344066 0.6918348 0.4786353
## 4 1.389324 0.6802779 0.4627780
## 5 1.414370 0.7251060 0.5257788
## 6 1.418984 0.7133923 0.5089285
## 7 2.107188 0.7266780 0.5280609
## 8 1.751122 0.7025074 0.4935166
## 9 1.466230 0.7377222 0.5442340
## 10 1.864578 0.6796911 0.4619799
## 11 1.808830 0.7335619 0.5381130
## 12 1.836281 0.7368593 0.5429616
## 13 1.840641 0.7429918 0.5520368
## 14 1.852366 0.7343269 0.5392360
## 15 3.498861 0.6991648 0.4888314
## 16 1.469833 0.7407563 0.5487199
## 17 1.937663 0.8500078 0.7225132
## 18 13.360339 0.7987343 0.6379766
## 19 1.869626 0.8401822 0.7059061
## 20 1.833578 0.7584323 0.5752195
## 21 1.709908 0.8122720 0.6597858
## 22 1.750916 0.7596533 0.5770732
## 23 1.500676 0.7341346 0.5389537
## 24 1.467663 0.7344108 0.5393593
## 25 1.464898 0.7341335 0.5389519
##
## [25 rows x 5 columns]
## H2OMultinomialMetrics: xgboost
##
## Test Set Metrics:
## =====================
##
## MSE: (Extract with `h2o.mse`) 0.4869021
## RMSE: (Extract with `h2o.rmse`) 0.6977837
## Logloss: (Extract with `h2o.logloss`) 1.313213
## Mean Per-Class Error: 0.6422532
## R^2: (Extract with `h2o.r2`) 0.6475947
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
## 2 3 4 5 6 7 8 9 Error Rate
## 2 0 0 0 0 1 0 0 0 1.0000 = 1 / 1
## 3 0 0 0 0 2 0 0 0 1.0000 = 2 / 2
## 4 0 0 0 0 9 1 0 0 1.0000 = 10 / 10
## 5 0 0 0 2 17 1 0 0 0.9000 = 18 / 20
## 6 0 0 0 0 51 11 0 0 0.1774 = 11 / 62
## 7 0 0 0 0 35 26 5 0 0.6061 = 40 / 66
## 8 0 0 0 0 8 12 24 0 0.4545 = 20 / 44
## 9 0 0 0 0 0 0 0 0 NA = 0 / 0
## Totals 0 0 0 2 123 51 29 0 0.4976 = 102 / 205
##
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
## =======================================================================
## Top-8 Hit Ratios:
## k hit_ratio
## 1 1 0.502439
## 2 2 0.824390
## 3 3 0.912195
## 4 4 0.946341
## 5 5 0.985366
## 6 6 1.000000
## 7 7 1.000000
## 8 8 1.000000
We ran several ML tests and the most interesting result so far is that studio is the second most important variable explaining movie popularity on IMDb. In our linear regression analysis we excluded this variable because it has over 100 different values so the regression process would take too long. With machine learning we don’t have this issue.
Gradient Boosting Machine (or GBM) model returns the most impressive results on training dataset with 0 errors. On the test data it has an error in around 50% of cases. It seems like we deal with overfitting problem here (or our interpretation of output is incorrect) . Further tests needed.
Let’s build a standalone GBM model
gbm <- h2o.gbm(6:32, 33, training_frame = train1, ntrees = 300,
## sample 80% of rows per tree
sample_rate = 0.8,
## sample 80% of columns per split
col_sample_rate = 0.8,
## fix a random number generator seed for reproducibility
seed = 1234,
## let's try to use 'stumps' instead of 'bushy' trees
max_depth = 1)## Warning in .h2o.startModelJob(algo, params, h2oRestApiVersion): Dropping bad and constant columns: [actor4, actor3, actor2, actor1, director, rt_url, imdb_url, title, actor5].
##
|
| | 0%
|
|======== | 11%
|
|=========================================== | 61%
|
|======================================================================| 100%
## Model Details:
## ==============
##
## H2OMultinomialModel: gbm
## Model ID: GBM_model_R_1579320800507_106
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 300 2400 219243 1
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 1 1.00000 2 2 2.00000
##
##
## H2OMultinomialMetrics: gbm
## ** Reported on training data. **
##
## Training Set Metrics:
## =====================
##
## Extract training frame with `h2o.getFrame("RTMP_sid_9d06_351")`
## MSE: (Extract with `h2o.mse`) 0.1201133
## RMSE: (Extract with `h2o.rmse`) 0.3465736
## Logloss: (Extract with `h2o.logloss`) 0.3736369
## Mean Per-Class Error: 0.07918575
## R^2: (Extract with `h2o.r2`) 0.9055787
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
## 2 3 4 5 6 7 8 9 Error Rate
## 2 2 0 0 0 0 0 0 0 0.0000 = 0 / 2
## 3 0 5 0 0 0 0 0 0 0.0000 = 0 / 5
## 4 0 0 15 1 4 0 0 0 0.2500 = 5 / 20
## 5 0 0 0 28 3 0 0 0 0.0968 = 3 / 31
## 6 0 0 1 0 151 13 0 0 0.0848 = 14 / 165
## 7 0 0 0 1 14 120 3 0 0.1304 = 18 / 138
## 8 0 0 0 0 4 2 78 0 0.0714 = 6 / 84
## 9 0 0 0 0 0 0 0 1 0.0000 = 0 / 1
## Totals 2 5 16 30 176 135 81 1 0.1031 = 46 / 446
##
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
## =======================================================================
## Top-8 Hit Ratios:
## k hit_ratio
## 1 1 0.896861
## 2 2 0.995516
## 3 3 1.000000
## 4 4 1.000000
## 5 5 1.000000
## 6 6 1.000000
## 7 7 1.000000
## 8 8 1.000000
## H2OMultinomialMetrics: gbm
##
## Test Set Metrics:
## =====================
##
## MSE: (Extract with `h2o.mse`) 0.4604392
## RMSE: (Extract with `h2o.rmse`) 0.6785567
## Logloss: (Extract with `h2o.logloss`) 1.762086
## Mean Per-Class Error: 0.6648888
## R^2: (Extract with `h2o.r2`) 0.6667478
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
## 2 3 4 5 6 7 8 9 Error Rate
## 2 0 0 0 0 1 0 0 0 1.0000 = 1 / 1
## 3 0 0 1 1 0 0 0 0 1.0000 = 2 / 2
## 4 0 0 0 0 10 0 0 0 1.0000 = 10 / 10
## 5 0 0 2 2 13 2 1 0 0.9000 = 18 / 20
## 6 0 0 2 3 44 12 1 0 0.2903 = 18 / 62
## 7 0 0 0 1 31 29 5 0 0.5606 = 37 / 66
## 8 0 0 1 0 6 18 19 0 0.5682 = 25 / 44
## 9 0 0 0 0 0 0 0 0 NA = 0 / 0
## Totals 0 0 6 7 105 61 26 0 0.5415 = 111 / 205
##
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
## =======================================================================
## Top-8 Hit Ratios:
## k hit_ratio
## 1 1 0.458537
## 2 2 0.721951
## 3 3 0.848781
## 4 4 0.931707
## 5 5 0.980488
## 6 6 0.995122
## 7 7 1.000000
## 8 8 1.000000
We did numerous experiments with GBM and the best result we came up with is the model that offers approximately 43.75% errors on test dataset. The model used the maximum number of trees of 58 and the tree depth of 1 (stumps). However, re-splitting the data made this model less more effective. It had around 66% of errors on a test data. So we needed to readjust the model using different number of trees to reduce the error rate.
One thing worth mentioning, majority of models we tested did pretty good job in a sense of deviation around true value. In most cases, predicted values are off just by one point (module of residual is equal to 1). For example, the true value is 7, the model predicts either 6 or 8 (not 3 or 9, for example). So it might be a better idea of creating a confidence interval for the predictions instead of reporting errors each time predictor is off by one point. It could be something like predictor ±1. Further research is needed.